detect: add chelae detect subcommand for adapter identification#8
Conversation
`chelae detect` samples reads from one or two FASTQ files and identifies which 3' adapter sequence(s) are present. Two modes: - Paired-end: runs the same overlap probe as `chelae trim` with an empty adapter-evidence library so every probe-accepted shift becomes a candidate detection, harvests the post-template tails on each mate, and builds a position-by-position consensus across reads that landed in the same 16-bp k-mer cluster. The consensus extends as long as per- position coverage stays above an adaptive floor. - Single-end: scores each read against every built-in kit adapter plus any user `--adapter-sequence` / `--adapter-fasta` candidate via the same scanner trim uses. Reports are split into two sections: matched kit(s) with their published sequences, then the full-length discovered consensus with uppercase marking the kit-stable region and lowercase marking the per-sample extension (typically the i7/i5 barcode). Optional `--output-fasta` writes a cross-sample-portable artifact: kit-matched hits emit the kit's published adapter sequence (not the sample-specific consensus), so the FASTA can be fed back into `chelae trim --adapter-fasta` and applied across an entire batch without barcode bleed-through breaking the per-sample match. Shared infrastructure with trim: BUFFER_SIZE and open_fastq_inputs lifted into commands/utils.rs; trim helpers (Adapter, OverlapStats, detect_pe_overlap, find_adapter_3prime, etc.) made pub(crate) for reuse; new load_adapter_fasta_with_names so user-curated FASTA record names survive into the detect output. Defaults are tuned for the discovery use case rather than trim's: --num-detections=5000, --min-detections-for-report=20 floor to refuse reporting on too-small samples, --adapter-min-length=10 (bumped from trim's 6 because detect scores 8+ candidates per read, raising the per-position random-match floor). README updated with examples + options table for the new subcommand.
Three follow-ups from the opus/sonnet/haiku adversarial pass: - Validate `--min-detections-for-report <= --max-reads` (previously validated only against `--num-detections`; a config like `--num-detections 200000 --max-reads 10000 --min-detections-for-report 50000` would pass validation and then guaranteed-fail mid-run). - `KIT_NEAR_MATCH_HAMMING` 2 -> 1. The FASTA writer substitutes the kit's published sequence in for any kit match (exact OR fuzzy), and 2 mm over 16 bp was too loose a bar for that substitution. A novel adapter sharing 14/16 bp with a kit prefix should not be silently relabeled as that kit; real sequencing-error fuzzy matches almost always sit at 1 mm given the per-base error rate. The ALL_KITS members are pairwise far enough that this doesn't cost legitimate matches. - `TailAccumulator::base_counts` widened to `[u64; 5]`. Defensive against pathological `--num-detections` values; theoretical-only at the default 5000. Plus a new substantive feature in response to the same review: apply the 3'-end cleanups from `chelae trim` (poly-G / poly-X / cut-right quality) to each read before it reaches the overlap probe or the candidate scan. detect's specificity bias justifies more aggressive defaults than trim's sensitivity-leaning ones: - `--trim-polyg N` default 10 (matches trim); `0` disables. - `--trim-polyx N` default 5 (down from trim's opt-in default of 10); `0` disables. - `--quality-trim W:Q` default `4:20` (smaller window than trim's `8:20` -> more aggressive); pass `off`/`none` to disable. Cut-right semantics — truncates at the start of the first failing window. Reuses the existing SIMD kernels via two new `pub(crate)` helpers: `find_polyx_tail_len` (already pure) and a freshly-extracted `cut_right_quality_position` (split out of `trim_quality_sliding_5prime` so detect doesn't need an `OwnedRecord`). Two of sonnet's findings were ruled out as non-bugs: the PE `insert >= r1_seq.len() || insert >= r2_seq.len()` skip on asymmetric reads is a sensitivity miss not a correctness bug (and matches the docstring contract); SE FASTA emitting the user's `--adapter-sequence` verbatim instead of substituting the kit canonical is the right behavior because the user explicitly asked us to score against that sequence. README options table updated. Smoke test against synthetic TruSeq + polyG dropoff + Q0 tail confirms the trims fire and the FASTA artifact stays kit-canonical.
- detect.rs: align with CLAUDE.md module-ordering convention. Move `QualityTrimSetting` (a helper enum for `Detect.quality_trim`) below `impl Command for Detect` since "higher-to-lower level" puts the CLI struct first. Move `BASES` from a body-level location up with the other top-of-file constants. Move `AnnotatedHit` up next to the other helper structs (Hit/Candidate/KitMatch/KitMate). Restore a docstring on `push_tail_kmer` that was clipped during a prior reorder. - CHANGELOG: split the previously-stale `[Unreleased]` block into a proper `[0.1.0] - 2026-05-13` entry (the v0.1.0 release), and rewrite `[Unreleased]` as a final-state diff vs v0.1.0 — not a chronological log of intermediate commits. The new entry covers the `chelae detect` subcommand, the `chelae trim --expected-insert-size` fix, and the shared `commands/utils.rs` refactor. No behavior change.
📝 Walkthrough📝 Walkthrough🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
src/bin/commands/detect.rs (1)
792-815: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winKeep
TailAccumulator’s impl adjacent to the struct.
base_to_slotsplits the struct from its inherent impl. Move the helper below the impl, or make it an associated helper.As per coding guidelines, command modules require “structs/enums with immediately-following impl blocks.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/bin/commands/detect.rs` around lines 792 - 815, `TailAccumulator` is separated from its inherent impl by the `base_to_slot` helper, violating the command-module rule that structs/enums must be immediately followed by their impl blocks. Move `base_to_slot` below `impl TailAccumulator`, or convert it into an associated helper on `TailAccumulator`, so the struct and its impl stay adjacent.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/bin/commands/detect.rs`:
- Around line 862-875: Remove the stale rustdoc about push_tail_kmer from the
effective_trimmed_len documentation block in detect.rs. Keep only the comments
that describe effective_trimmed_len’s 3' cleanup order and returned effective
length, and delete the bucket/k-mer insertion text so the docs match the
function name and behavior.
- Around line 1123-1163: The final human-readable report emitted by
`emit_pe_report` and `emit_se_report` is still going through `info!`, so it is
landing on stderr instead of stdout. Update the report path in these functions
to write explicitly to stdout (for example via a stdout writer/print-style
output), while leaving non-report progress logging on `info!`. Use the existing
symbols `emit_pe_report`, `emit_se_report`, `emit_matched_kits_section`, and
`emit_full_length_rows` to keep the report assembly unchanged apart from the
output target.
- Around line 355-364: The paired-read loop in detect::process_paired_reads
handles R2 EOF as an error but lets R1 EOF break silently, so mismatched inputs
can be accepted. Update the reader1/reader2 matching logic so EOF on reader1
also checks whether reader2 still has data and returns an out-of-sync error when
R2 is longer, mirroring the existing R2 exhausted before R1 path. Use the
reader1.next(), reader2.next(), and process_paired_reads match block to locate
the fix.
In `@src/bin/commands/utils.rs`:
- Around line 17-34: The `open_fastq_inputs` helper is using
`Io::new(paths.len().max(1) as u32, BUFFER_SIZE)` as if the first argument
controlled reader count, but it actually sets the compression level. Update
`open_fastq_inputs` in `utils.rs` to pass a fixed compression level consistent
with the rest of the codebase (as seen in `load_adapter_fasta_with_names`), and
revise the surrounding doc comment to describe the real meaning of
`Io::new(level, buffer_size)` instead of mentioning a thread pool or input
count.
---
Nitpick comments:
In `@src/bin/commands/detect.rs`:
- Around line 792-815: `TailAccumulator` is separated from its inherent impl by
the `base_to_slot` helper, violating the command-module rule that structs/enums
must be immediately followed by their impl blocks. Move `base_to_slot` below
`impl TailAccumulator`, or convert it into an associated helper on
`TailAccumulator`, so the struct and its impl stay adjacent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5694ef6d-25c7-4fa1-b0dc-d8b0289c434c
📒 Files selected for processing (8)
CHANGELOG.mdCLAUDE.mdREADME.mdsrc/bin/commands/detect.rssrc/bin/commands/mod.rssrc/bin/commands/trim.rssrc/bin/commands/utils.rssrc/bin/main.rs
| /// Emits the human-readable PE report on stdout via `info!`. Two sections: | ||
| /// 1. Matched kit(s): one line per mate per kit match, with the kit's full | ||
| /// published adapter sequence. | ||
| /// 2. Full-length discovered consensus: per-mate, the consensus as observed | ||
| /// in the data, with uppercase marking the kit-matched prefix and | ||
| /// lowercase marking the extension past it (typically the i7 barcode). | ||
| fn emit_pe_report( | ||
| detections: u64, | ||
| reads_scanned: u64, | ||
| overlap_hits: u64, | ||
| r1: &[AnnotatedHit<'_>], | ||
| r2: &[AnnotatedHit<'_>], | ||
| ) { | ||
| info!("chelae detect complete (paired-end):"); | ||
| info!( | ||
| " scanned: {} pair(s) ({} overlap hits, {} usable detections)", | ||
| fmt_count(reads_scanned), | ||
| fmt_count(overlap_hits), | ||
| fmt_count(detections), | ||
| ); | ||
| emit_matched_kits_section(r1.iter().chain(r2.iter())); | ||
| info!(" Full-length discovered consensus (uppercase = matches kit, lowercase = extension):"); | ||
| emit_full_length_rows(r1, detections); | ||
| emit_full_length_rows(r2, detections); | ||
| } | ||
|
|
||
| /// Emits the human-readable SE report on stdout via `info!`. Same two-section | ||
| /// layout as PE, with the section-2 sequences rendered through the same | ||
| /// uppercase/lowercase formatter (no-op for SE because the candidate's | ||
| /// sequence and the matched kit's sequence are typically identical). | ||
| fn emit_se_report(detections: u64, reads_scanned: u64, reported: &[AnnotatedHit<'_>]) { | ||
| info!("chelae detect complete (single-end):"); | ||
| info!( | ||
| " scanned: {} read(s); {} matched a candidate adapter", | ||
| fmt_count(reads_scanned), | ||
| fmt_count(detections), | ||
| ); | ||
| emit_matched_kits_section(reported.iter()); | ||
| info!(" Candidate adapter(s) above --min-fraction:"); | ||
| emit_full_length_rows(reported, detections); | ||
| } |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -n 'env_logger::Builder|Target::Stdout|target\(|info!\(|println!|writeln!\(' src/bin/main.rs src/bin/commands/detect.rsRepository: fulcrumgenomics/chelae
Length of output: 1843
Route the final report to stdout explicitly.
Current info! calls in emit_pe_report and emit_se_report write to stderr due to the env_logger default target. Redirecting stdout misses the report. Use println! or a writeln!(std::io::stdout()) wrapper for the report sections; keep progress logs on info!.
[docs: src/bin/main.rs:49, src/bin/commands/detect.rs:1136-1163]
Current Report Emitter
fn emit_pe_report(...) {
info!("chelae detect complete (paired-end):");
// ...
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/bin/commands/detect.rs` around lines 1123 - 1163, The final
human-readable report emitted by `emit_pe_report` and `emit_se_report` is still
going through `info!`, so it is landing on stderr instead of stdout. Update the
report path in these functions to write explicitly to stdout (for example via a
stdout writer/print-style output), while leaving non-report progress logging on
`info!`. Use the existing symbols `emit_pe_report`, `emit_se_report`,
`emit_matched_kits_section`, and `emit_full_length_rows` to keep the report
assembly unchanged apart from the output target.
- (#1) PE loop: when R1 hits EOF, check whether R2 still has records. If it does, surface an out-of-sync error mirroring the existing "R2 exhausted before R1" path, instead of silently truncating to the shorter file. Added a `pe_r1_short_produces_clear_error` test. - (#3) detect.rs: remove stale `push_tail_kmer` rustdoc that got pasted onto the top of `effective_trimmed_len`'s doc block during the housekeeping pass. - (#4) utils.rs: `Io::new`'s first argument is the gzip compression level, not a thread-pool size. The original trim code threaded `paths.len()` there in the mistaken belief it sized a decompression pool; it doesn't (fgoxide's readers are synchronous and the level is only consulted by writers). Switched to `Io::new(5, BUFFER_SIZE)` matching `chelae trim`'s other use sites, and rewrote the doc comment to describe the real meaning. - (#5) detect.rs: `base_to_slot` now lives inside `impl TailAccumulator`, restoring the "struct immediately followed by its impl block" adjacency rule that the housekeeping pass had broken. Call sites updated to `TailAccumulator::base_to_slot`. Skipped (#2): the `info!`-vs-stdout report-routing finding. The project-wide pattern is that end-of-run summaries go through `info!` (matches `chelae trim`'s summary path); changing detect alone would introduce an inconsistency. Left for a separate decision.
Deep-review pass on the branch surfaced ten fix-in-place items plus
one behavioral question about how detect emits novel adapters that
have a variable region (e.g. i7 barcode past a kit-stable prefix).
Consensus cut (behavioral, was pure majority):
TailAccumulator::consensus now tracks a running-min baseline of the
per-column majority fraction and stops at (a) coverage < min_coverage,
(b) majority < 50%, or (c) a drop of > 10 pp below the baseline.
Downstream aggregate_kmers drops any primary whose consensus can't
make it to TAIL_KMER_LEN — a cluster that heterogeneous inside the
k-mer window shouldn't be reported. Rationale: previously a variable
16 fixed + N variable adapter (imbalanced pool where one index
dominates at 25%) would silently emit that plurality base as
canonical. Discontinuity cut catches the transition cleanly.
Behavioral fixes:
* PE / SE --output-fasta now hard-fails when either mate (PE) or
the whole set (SE) fails to clear --min-fraction, instead of
writing an empty or half-empty FASTA that downstream trim would
silently accept.
* load_adapter_fasta_with_names now preserves headers with empty
bodies (>foo\n>bar\nACGT) and rejects them at validate time, not
silently drops them.
* --output-fasta paths ending in .gz are transparently gzipped via
fgoxide::Io::new_writer instead of writing raw bytes to a .gz
extension.
* detect --adapter-sequence <SEQ> shorter than --adapter-min-length
is rejected in validate() (would have silently never matched).
* --adapter-sequence dropped num_args = 1.. so a stray positional
after -a is no longer greedily consumed as another adapter value.
Doc / API-surface hygiene:
* SE-only flag docstrings now say the flag errors on PE input
instead of "Ignored on PE" (mismatched validate() behavior).
* --quality-trim doc lists 'no' as a disable keyword; README the
same. --quality-trim comparison to trim's default is now
accurately worded ("trim's opt-in --quality-trim-3p default").
* Zero-detection PE error now names --min-tail-length as a remedy
when overlap_hits > 0.
* PE min-tail-length comment now describes what the code actually
does (both mates required; shared detections denominator makes
asymmetric harvesting unsound).
* trim.rs items promoted for detect's use tightened back down:
load_adapter_fasta reverted to private (only _with_names crosses
the module boundary); OverlapAdapterLibrary struct stays
pub(crate) but fields and is_empty() go back to private.
* main.rs comment no longer claims Trim is the only variant.
* utils.rs now imports PathBuf so the fully-qualified signatures
match the rest of the file's style.
* push_tail_kmer drops the redundant tail.to_ascii_uppercase()
(observe already case-folds via base_to_slot).
Tests added:
* consensus_extends_through_steady_high_majority
* consensus_extends_through_gently_declining_majority
* consensus_cuts_at_sharp_discontinuity
* consensus_cuts_at_absolute_floor_when_pool_dominates
* consensus_stops_when_coverage_below_min
* aggregate_kmers_drops_primaries_with_short_consensus
* se_below_detection_floor_errors_instead_of_reporting
* validate_accepts_floor_equal_to_num_detections
* validate_accepts_floor_equal_to_max_reads
* validate_rejects_user_candidate_shorter_than_adapter_min_length
* pe_output_fasta_errors_when_min_fraction_excludes_all_hits
* se_output_fasta_errors_when_min_fraction_excludes_all_hits
* load_adapter_fasta_errors_on_header_without_body
Verification: bash ci/check.sh (fmt + clippy + 276 tests) green;
smoke test on a real PE library (SRR873822) still identifies TruSeq
exact match on both mates with kit-canonical sequences in the FASTA
and gzip-by-extension working.
|
@coderabbitai review |
✅ Action performedReview finished.
|
Summary
Adds a second subcommand,
chelae detect, that identifies which 3' adapter sequence(s) are present in one or two FASTQ files.--adapter-sequence/--adapter-fastacandidate via the same scanner trim uses.chelae trim --adapter-fastaon every sample in a batch.Also fixes
chelae trim --expected-insert-sizeso the hint actually takes effect on the first pair, and lifts shared FASTQ-reader infrastructure intocommands/utils.rs.Notable design decisions
detect_pe_overlapwith an emptyOverlapAdapterLibraryso probe acceptance discovers novel adapters. Low-complexity inserts (poly-A cfDNA, microsatellites) can produce spurious "discovered" tails, but the kit/library noise floor surfaces in the report rather than hiding the real adapter when present.u64counters inTailAccumulator::base_counts(defensive against pathological--num-detections).4:20) vs trim's conservative ones — detect can afford to scrub more because it's looking for the dominant adapter signal across many reads.Review history
This branch went through five review rounds (one independent reviewer plus an opus/sonnet/haiku adversarial pass after the major changes). Findings acted on are described in the individual commit messages; findings ruled out (with rationale) are documented inline:
--adapter-sequenceverbatim (not substituting the kit canonical) — intentional, that's what the user explicitly asked us to score against.insert >= r1_seq.len() || insert >= r2_seq.len()skipping asymmetric-read pairs — matches the docstring contract; sensitivity miss, not correctness bug.aggregate_kmers— sub-second at default settings (N ≤ 5000).Test plan
bash ci/check.sh(fmt + clippy + tests) green on the branchchelae detect -o adapters.fa→chelae trim --adapter-fasta adapters.faon the same library, confirm trim metrics look right